95 research outputs found

    Promoter prediction in E. coli based on SIDD profiles and Artificial Neural Networks

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One of the major challenges in biology is the correct identification of promoter regions. Computational methods based on motif searching have been the traditional approach taken. Recent studies have shown that DNA structural properties, such as curvature, stacking energy, and stress-induced duplex destabilization (SIDD) are useful in promoter prediction, as well. In this paper, the currently used SIDD energy threshold method is compared to the proposed artificial neural network (ANN) approach for finding promoters based on SIDD profile data.</p> <p>Results</p> <p>When compared to the SIDD threshold prediction method, artificial neural networks showed noticeable improvements for precision, recall, and <it>F</it>-score over a range of values. The maximal <it>F</it>-score for the ANN classifier was 62.3 and 56.8 for the threshold-based classifier.</p> <p>Conclusions</p> <p>Artificial neural networks were used to predict promoters based on SIDD profile data. Results using this technique were an improvement over the previous SIDD threshold approach. Over a wide range of precision-recall values, artificial neural networks were more capable of identifying distinctive characteristics of promoter regions than threshold based methods.</p

    Analysis of computational approaches for motif discovery

    Get PDF
    Recently, we performed an assessment of 13 popular computational tools for discovery of transcription factor binding sites (M. Tompa, N. Li, et al., "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites", Nature Biotechnology, Jan. 2005). This paper contains follow-up analysis of the assessment results, and raises and discusses some important issues concerning the state of the art in motif discovery methods: 1. We categorize the objective functions used by existing tools, and design experiments to evaluate whether any of these objective functions is the right one to optimize. 2. We examine various features of the data sets that were used in the assessment, such as sequence length and motif degeneracy, and identify which features make data sets hard for current motif discovery tools. 3. We identify an important feature that has not yet been used by existing tools and propose a new objective function that incorporates this feature

    Efficient and accurate P-value computation for Position Weight Matrices

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.</p> <p>Results</p> <p>The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available.</p> <p>Conclusion</p> <p>We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.</p

    SEARCHPATTOOL: a new method for mining the most specific frequent patterns for binding sites with application to prokaryotic DNA sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Computational methods to predict transcription factor binding sites (TFBS) based on exhaustive algorithms are guaranteed to find the best patterns but are often limited to short ones or impose some constraints on the pattern type. Many patterns for binding sites in prokaryotic species are not well characterized but are known to be large, between 16–30 base pairs (bp) and contain at least 2 conserved bases. The length of prokaryotic species promoters (about 400 bp) and our interest in studying a small set of genes that could be a cluster of co-regulated genes from microarray experiments led to the development of a new exhaustive algorithm targeting these large patterns.</p> <p>Results</p> <p>We present Searchpattool, a new method to search for and select the most specific (conservative) frequent patterns. This method does not impose restrictions on the density or the structure of the pattern. The best patterns (motifs) are selected using several statistics, including a new application of a z-score based on the number of matching sequences. We compared Searchpattool against other well known algorithms on a <it>Bacillus subtilis </it>group of 14 input sequences and found that in our experiments Searchpattool always performed the best based on performance scores.</p> <p>Conclusion</p> <p>Searchpattool is a new method for pattern discovery relative to transcription factor binding sites for species or genes with short promoters. It outputs the most specific significant patterns and helps the biologist to choose the best candidates.</p

    Analysis of Gene Regulatory Networks in the Mammalian Circadian Rhythm

    Get PDF
    Circadian rhythm is fundamental in regulating a wide range of cellular, metabolic, physiological, and behavioral activities in mammals. Although a small number of key circadian genes have been identified through extensive molecular and genetic studies in the past, the existence of other key circadian genes and how they drive the genomewide circadian oscillation of gene expression in different tissues still remains unknown. Here we try to address these questions by integrating all available circadian microarray data in mammals. We identified 41 common circadian genes that showed circadian oscillation in a wide range of mouse tissues with a remarkable consistency of circadian phases across tissues. Comparisons across mouse, rat, rhesus macaque, and human showed that the circadian phases of known key circadian genes were delayed for 4–5 hours in rat compared to mouse and 8–12 hours in macaque and human compared to mouse. A systematic gene regulatory network for the mouse circadian rhythm was constructed after incorporating promoter analysis and transcription factor knockout or mutant microarray data. We observed the significant association of cis-regulatory elements: EBOX, DBOX, RRE, and HSE with the different phases of circadian oscillating genes. The analysis of the network structure revealed the paths through which light, food, and heat can entrain the circadian clock and identified that NR3C1 and FKBP/HSP90 complexes are central to the control of circadian genes through diverse environmental signals. Our study improves our understanding of the structure, design principle, and evolution of gene regulatory networks involved in the mammalian circadian rhythm

    Wide-Scale Analysis of Human Functional Transcription Factor Binding Reveals a Strong Bias towards the Transcription Start Site

    Get PDF
    We introduce a novel method to screen the promoters of a set of genes with shared biological function, against a precompiled library of motifs, and find those motifs which are statistically over-represented in the gene set. The gene sets were obtained from the functional Gene Ontology (GO) classification; for each set and motif we optimized the sequence similarity score threshold, independently for every location window (measured with respect to the TSS), taking into account the location dependent nucleotide heterogeneity along the promoters of the target genes. We performed a high throughput analysis, searching the promoters (from 200bp downstream to 1000bp upstream the TSS), of more than 8000 human and 23,000 mouse genes, for 134 functional Gene Ontology classes and for 412 known DNA motifs. When combined with binding site and location conservation between human and mouse, the method identifies with high probability functional binding sites that regulate groups of biologically related genes. We found many location-sensitive functional binding events and showed that they clustered close to the TSS. Our method and findings were put to several experimental tests. By allowing a "flexible" threshold and combining our functional class and location specific search method with conservation between human and mouse, we are able to identify reliably functional TF binding sites. This is an essential step towards constructing regulatory networks and elucidating the design principles that govern transcriptional regulation of expression. The promoter region proximal to the TSS appears to be of central importance for regulation of transcription in human and mouse, just as it is in bacteria and yeast.Comment: 31 pages, including Supplementary Information and figure

    The value of position-specific priors in motif discovery using MEME

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Position-specific priors have been shown to be a flexible and elegant way to extend the power of Gibbs sampler-based motif discovery algorithms. Information of many types–including sequence conservation, nucleosome positioning, and negative examples–can be converted into a prior over the location of motif sites, which then guides the sequence motif discovery algorithm. This approach has been shown to confer many of the benefits of conservation-based and discriminative motif discovery approaches on Gibbs sampler-based motif discovery methods, but has not previously been studied with methods based on expectation maximization (EM).</p> <p>Results</p> <p>We extend the popular EM-based MEME algorithm to utilize position-specific priors and demonstrate their effectiveness for discovering transcription factor (TF) motifs in yeast and mouse DNA sequences. Utilizing a discriminative, conservation-based prior dramatically improves MEME's ability to discover motifs in 156 yeast TF ChIP-chip datasets, more than doubling the number of datasets where it finds the correct motif. On these datasets, MEME using the prior has a higher success rate than eight other conservation-based motif discovery approaches. We also show that the same type of prior improves the accuracy of motifs discovered by MEME in mouse TF ChIP-seq data, and that the motifs tend to be of slightly higher quality those found by a Gibbs sampling algorithm using the same prior.</p> <p>Conclusions</p> <p>We conclude that using position-specific priors can substantially increase the power of EM-based motif discovery algorithms such as MEME algorithm.</p
    corecore